Script description:

The aim of this script is to perform a statistical analysis of untargeted metabolomics data which has been previously pre-treated with the script “Prepare_Datasets_to_Statistical_Analysis.Rmd”. This performs several tasks to achieve this end:

  1. Read the raw data files from the input folder.

  2. Normalize data set with the package “NormalyzerDE” without the need of a design input (it is created with the information given in the input files).

  3. After the review and evaluation of the normalization report by the researcher, the selected data normalized is uploaded by changing line 178 with the corresponding file. You can choose among the following options: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt. If you do not want to use any normalized file, then set it as “None”.

  4. Data visualization before and after the normalization with box plot from “NormalyMets” package.

  5. Principal Component Analysis (PCA) and Hierarchical Cluster Analysis (HCA) from packages “FactoMineR” and “factoextra” for normalized dataset to check the right group differentiation and a possible batch effect.

  6. Statistical analysis, using the “Limma” package, performed to extract the p-values and log2 fold change values (LFC) for every metabolite in the conducted comparative analysis. Additionally, p-value adjustment methods, such as Benjamini-Hochberg, Bonferroni, and q-values were applied. Tables with the results are given, as well as plots displaying the original p-values and q-values, which are available for visualization.

  7. The identification of significant metabolites will be based on the statistical thresholds set at point 4 of the usage instructions. A message will be displayed indicating the number of decreased and increased metabolites for each adjustment method, taking into account the linear model used in the Limma analysis. This will provide insights into the directionality of differential expression for the identified metabolites. [As the test model was set up as_“contrast <- paste0(group[2],”-“,group[1])”. _This instruction represent the measures difference in expression levels between Group 2 and Group 1. A positive LFC indicates higher expression in Group 2 compared to Group 1, while a negative LFC indicates higher expression in Group 1 compared to Group 2. The magnitude of the LFC represents the change in expression between the two groups.] The number of significant differences will be represented by venn-diagrams (“ggVennDiagram” package) and upset plot (“UpSetR” package).

  8. After having selected a p-value adjustment method in line 750, the significant metabolites based on the statistical thresholds will be plotted by Mean-Average plots and Volcano plots, also a dynamic volcano plot is displayed where you can see the information for each dot (metabolite). Tables with the 10 most significant metabolites for each comparative will be displayed.

  9. Finally, a sparse Partial Least Squares Discriminant Analysis (sPLS-DA) will be conducted using the “mixOmics” package. This analysis uses 2 components and 25 variables. The results for the sPLS-DA analysis include a loadings plot, individual samples plot, variable plot, and a prediction essay. Additionally, predictions results and the top 10 most important variable predictors for each comparative will be presented in tables, providing valuable insights into the predictive power of the model and the variables driving the differences between the comparatives.

To run this script, you need to provide the input and output folder addresses, as well as the statistical thresholds.


USAGE INSTRUCTIONS:

  1. The environment must be clear. If not, run this “chunk”:
rm(list = ls())
  1. Input data must proceeded from the script “Prepare_Datasets_to_Statistical_Analysis.Rmd”, because they must have the words “one_factor” in their name.

  2. You will need the following packages

packages <- c("tidyverse", "plotly", "mixOmics", "devtools", "NormalizeMets", "tools", "kableExtra",
              "qvalue", "NormalyzerDE", "limma", "FactoMineR", "factoextra", "ggVennDiagram", "UpSetR", "ggpubr",
              "ggrepel")

for (package in packages) {
  if (!requireNamespace(package, quietly = TRUE)) {
    if (package == "NormalizeMets") {
      devtools::install_github("metabolomicstats/NormalizeMets")
    } else {
      install.packages(package)
    }
  }
  library(package, character.only = TRUE)
}
  1. Set the input, output folders path here, as well as the statistical limits that will be used in the analysis
# Please indicate if your data NA values have been impute in the previous script or not. Yes/No
imputed <- "Yes"

if (imputed == "Yes") {
  input_folder <- "../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use"
  dir.create(paste0(input_folder, "/Normalyze_data"))
  output_folder <- "../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data"
  dir.create("../Images/Imputed_separate_comparatives")
  image_folder <- "../Images/Imputed_separate_comparatives"
} else {
  input_folder <- "../1_Data/Untargeted_metabolomic_data/Common_ready_to_use"
  dir.create(paste0(input_folder, "/Normalyze_data"))
  output_folder <- "../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data"
  dir.create("../Images/Common_metabolites")
  image_folder <- "../Images/Common_metabolites"
}

# Set Normalize method on line 183 #
# Set adjusted p-value on line 750 #

# Set the statistical parameters for the significant analysis
stats_values <- list()
LFC_limit <- "LFC_limit"; stats_values[[LFC_limit]] <- 2
alpha_value <- "alpha_value"; stats_values[[alpha_value]] <- 0.01

files <- list.files(input_folder)

files <- files[grepl("one_factor", files, ignore.case = T)]

——————————————————————————–


Previous steps to perform with my datasets

PERSONAL USE

To ensure the correc organization of the comparative A columns and accurate assignment of the comparative E groups, it is necessary to execute this specific code section for my data.

for (i in files){
  # Read data files
  data_path <- file.path(input_folder, i)
  data <- read.csv(data_path)
  group <- data [1, ] %>% .[-1] %>% as.character(.) %>% unique(.)
  
  # Perform the changes if necessary
  if (group[1] == "cA_S17" & group[2] == "cA_S25") {
    data <- data[, c(1, 5:7, 2:4)]
    # Overwrite the original file
    write.csv(data, file = data_path, row.names = FALSE)
  } else if (group[1] == "cA_S17" & group[2] == "cB_OTA_S17") {
    group <- c("group", rep("cE_noOTA", 3), rep("cE_OTA", 3))
    data <- data [-1, ] %>% rbind(group, .)
    # Overwrite the original file
    write.csv(data, file = data_path, row.names = FALSE)
  } 
}

files <- list.files(input_folder)
files <- files[grepl("one_factor", files, ignore.case = T)]

Normalization

This code section iterates over a list of files, reads each file’s data, prepares the dataset by assigning row names and extracting group information, writes the design and raw data to separate files, applies the normalyzer function for normalization, saves the normalized result, and finally removes temporary files.

for (i in files){
    # Read data files
    data_path <- file.path(input_folder, i)
    raw_data <- read.csv(data_path)
    
    # Prepare dataset
    rownames(raw_data) <- raw_data$name
    group <- raw_data[1, -1] %>% as.character()
    raw_data <- raw_data[-1, -1]

    # Design and temporal objects
    design <- data.frame(sample=colnames(raw_data), group=group)

    write.table(x = design, file = paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"),
                quote = F, row.names = F, sep = "\t")
    write.table(x = raw_data, file = paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"),
                quote = F, row.names = F, sep = "\t")

    # NormalyzerDE aplication
    suppressMessages(normalyzer(jobName = paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
                                designPath = paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"),
                                dataPath = paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"),
                                outputDir = output_folder))
    cat(sprintf("File %s has been Normalyzed and the result has been saved in %s.\n", i,
                paste0(output_folder, "/", sub("_.*.?", "", i), "_NormalyzerDE\n")))

    
    unlink(paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"), recursive = T)
    unlink(paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"), recursive = T)
}

For the example data that was used to optimize this script and develop the corresponding Final Master Project, you can refer to the normalization summaries in the following files.

Normalized summary of Comparative A

Normalized summary of Comparative B

Normalized summary of Comparative C

Normalized summary of Comparative D

Normalized summary of Comparative E

# You can choose between the followings data normalized files: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt. 
# If you do not want to use any normalized file set "None" 

norm_selected <- "/Quantile-normalized.txt"

if (norm_selected != "None") {
  
  for (i in files){
    # Set the new name for the normalized data that will be analysed
    name <- paste0(sub("_.*.?", "", i), "_normalized") 
    data <- read.table(paste0(output_folder, "/", paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
                              norm_selected), header = T)
    
    # Rename rows according to their original names in the not normalized data. It is posible because the order is mainteined during the normalization process but 
    data2 <- read.csv(paste0(input_folder, "/", i))
    rownames(data2) <- data2$name
    group <- data2[1, -1] %>% rownames_to_column(); data2 <- data2[-1, -1]
    row.names(data) <- row.names(data2)
    
    # Reestructure the dataset by rearranging the columns
    data <- rownames_to_column(data, var = "name")
    data <- rbind(as.character(group), data)
    data[is.na(data)] <- 0
    
    assign(name, data)            
    write.csv(data, file = paste0(output_folder, "/", paste0(sub("_.*.?", "", i),
                                                             "_normalized.csv")), row.names = FALSE)
    
    cat(sprintf("File %s normalized by %s method has been saved in %s.\n", i,
                sub(".*/(.*?)\\-.*", "\\1", norm_selected),
                paste0(output_folder, "/", paste0(sub("_.*.?", "", i), "_normalized.csv \n"))))
  }
  
  files2 <- ls()
  files2 <- files2[grepl("normalized", files2, ignore.case = T)]
  
  for (i in files){
    # Set the new name for the normalized data that will be analysed
    name <- paste0(sub("_.*.?", "", i), "_no_normalized")
    data_path <- file.path(input_folder, i)
    data <- read.csv(data_path)
    assign(name, data) 
    
    files <- ls()
    files <- files[grepl("no_normalized", files, ignore.case = T)]
  }
  
} else {
  
  for (i in files){
    # Set the new name for the normalized data that will be analysed
    name <- paste0(sub("_.*.?", "", i), "_no_normalized") 
    data_path <- file.path(input_folder, i)
    data <- read.csv(data_path)
    assign(name, data) 
    
    files <- ls()
    files <- files[grepl("no_normalized", files, ignore.case = T)]
  }

}
## File compA_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data/compA_normalized.csv 
## .
## File compB_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data/compB_normalized.csv 
## .
## File compC_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data/compC_normalized.csv 
## .
## File compD_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data/compD_normalized.csv 
## .
## File compE_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data/compE_normalized.csv 
## .

Data visualization

Boxplot

A box plot is a visual summary of a dataset’s distribution. It helps to identify the center, spread, and skewness of the data. To assess data normality using a box plot, look for symmetry, absence of outliers, balanced box length, symmetric whisker length, and a median close to the box center.

This boxplot has been made thanks to “normalizeMets” package, which use ggplot2 graphs. In raw data case, it was log transformed to set similar axis as the ones of normalized data.

plots1 <- list()
plots2 <- list()

if (norm_selected != "None") {
  for (i in files){
      data <- get(i)
      
      # Prepare dataset without normalization
      group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
      met_names <- data$name %>% .[-1]
      data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
      data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
  
      # Original data visualization (log transformed and numeric format)
      plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F, 
                       plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
                       savetype = "png", interactiveplot = T)
      index <- length(plots1) + 1
      plots1[[index]] <- plot
  }
  
  for (i in files2){
    data <- get(i)
    
    # Prepare dataset normalized
    group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
    met_names <- data$name %>% .[-1]
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1]
    data <- data %>% t() %>% `colnames<-`(met_names)

    # Normalized data visualization (log transformed and numeric format)
    plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F, 
                       plotname = paste0("BoxPlot of ", sub("_.*.?", "", i),
                                         " normalized by ", sub(".*/(.*?)\\-.*", "\\1", norm_selected), " method"),
                       savetype = "png", interactiveplot = T)
    index <- length(plots2) + 1
    plots2[[index]] <- plot
  }
  
} else {
  
  for (i in files){
      data <- get(i)
      
      # Prepare dataset without normalization
      group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
      met_names <- data$name %>% .[-1]
      data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
      data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
  
      # Original data visualization (log transformed and numeric format)
      plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F, 
                       plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
                       savetype = "png", interactiveplot = T)
      index <- length(plots2) + 1
      plots2[[index]] <- plot
  }
}

Exploratory data analysis

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique that seeks to reduce the dimensionality of a dataset by maximizing the variance. It achieves this by transforming the original variables into a smaller set of uncorrelated variables called principal components. The goal is to retain as much relevant information as possible while minimizing the loss of information. PCA helps simplifying complex datasets, visualize data, and identify the most important patterns or features contributing to the overall variance.

if (norm_selected == "None") {
  for (i in files) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,])
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # PCA performance
    tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
    ## Boxplot
    tmp_eig <- tmp_pca$eig %>% .[1:5, ]
    barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
            main = paste("PCA of", sub("_.*.?", "", i), "no normalized"),
            xlab = "Principal Components",
            ylab = "Percentage of variances",
            col ="gold")
    ## Graph of individuals
    plot <- fviz_pca_ind(tmp_pca, col.ind = group, 
               pointsize=1, pointshape=21,fill="black",
               repel = T, 
               addEllipses = T, ellipse.type = "confidence",
               legend.title ="Conditions",
               show_legend = TRUE, show_guide = TRUE)
    labels <- rownames(data)
    plot <- plot + 
    ggtitle(paste("PC1 & PC2 of", sub("_.*.?", "", i), "no normalized")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/PCA_plot_of_", sub("_.*.?", "", i), "_no_normalized.png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
} else {
  for (i in files2) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,]); names <- colnames(data)
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # PCA performance
    tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
    ## Boxplot
    tmp_eig <- tmp_pca$eig %>% .[1:5, ]
    barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
            main = paste("PCA of", sub("_.*.?", "", i),
                                        "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"),
            xlab = "Principal Components",
            ylab = "Percentage of variances",
            col ="gold")
    ## Graph of individuals
    plot <- fviz_pca_ind(tmp_pca, col.ind = group, 
               pointsize=1, pointshape=21,fill="black",
               repel = T, 
               addEllipses = T, ellipse.type = "confidence",
               legend.title ="Conditions",
               show_legend = TRUE, show_guide = TRUE)
    labels <- rownames(data)
    plot <- plot +
      ggtitle(paste("PC1 & PC2 of", sub("_.*.?", "", i),
                                        "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
      theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/PCA_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
}

Hierarchical Clustering on Principal Components (HCPC) analysis

HCPC (Hierarchical Clustering on Principal Components) is a statistical method that combines hierarchical clustering and principal component analysis (PCA). It is used for exploratory analysis and clustering of multivariate data.

HCPC first performs PCA on the dataset to obtain principal components that capture the variability in the data. Then, it applies hierarchical clustering on these principal components to group similar observations together based on their distance or similarity measures.

HCPC is particularly useful when dealing with high-dimensional datasets, as it allows for a comprehensive exploration of the data by revealing the underlying clusters or patterns. It can assist in identifying subgroups within a dataset and gaining a better understanding of the relationships between variables.

if (norm_selected == "None") {
  for (i in files) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,])
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # HCA performance
    res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
    res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)  
    plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet", 
                      label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
                      type="rectangle", labels_track_height = 1000, horiz = T,repel = T) + 
      ggtitle(paste("Cluster Dendrogram of", sub("_.*.?", "", i), "no normalized")) +
      theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/HCPC_plot_of", sub("_.*.?", "", i), "_no_normalized.png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
} else {
  for (i in files2) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,]); names <- colnames(data)
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # HCA performance
    res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
    res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)  
    plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet", 
                      label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
                      type="rectangle", labels_track_height = 1000, horiz = T, repel = T) + 
      ggtitle(paste("Cluster Dendrogram of", sub("_.*.?", "", i),
                                        "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
      theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/HCPC_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
}

Clear environment

——————————————————————————–


Stats

p-value

The p-value is a measure used in statistical hypothesis testing to determine the significance of a comparison. When the p-value is smaller than a significance level “α”, it indicates that there is sufficient evidence to reject the null hypothesis (H0). Most usually “α” value use to be 0.05 (5%) or 0.01 (1%).

In this research, where “α” = 0.05, the null hypothesis suggests that there are no significant differences between the amounts of the metabolite in each condition. By rejecting the H0 and accepting the alternative hypothesis (H1), we conclude that there are statistically significant differences in the amounts of the metabolite between the conditions being compared.

* H0 —> there is no difference between the sample groups for the amounts of the metabolites. * H1 —> There is a difference between the sample groups for the amounts of the metabolites.

  • If p-value < 0.05 we refuse the H0 and accept the H1. There are statistically significant differences.
  • If p-value > 0.05 we no refuse the H0. There are no statistically significant differences.

LFC

This indicates a significant difference in the metabolite’s amount between two different conditions. By fixing an LFC value of 1 or -1, we will consider as statistically significant only those metabolites that have a difference in amount of at least two-fold or greater in either direction.

If we fix an LFC value of 2 or -2, we will be more restrictive and consider only those metabolites that have a difference in amount of at least four-fold or greater in either direction. This approach will capture larger and more pronounced changes in metabolite levels.

files2 <- vector()

for (i in files){
  # Dataset set-up
  data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
  met_names <- row.names(data) %>% .[-1]
  group_rep <- as.character(data[1 ,]) %>% table(.)
  group <- as.character(data[1 ,]) %>% unique(.)  
  data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% `row.names<-` (., met_names)
  
  # Prepare limma assay
  experimental.design <- model.matrix(~ -1+factor(c(rep(1, group_rep[[1]]),rep(2, group_rep[[2]])))) %>%
    `colnames<-`(., group)
  
  # Limma assay (condition 2 vs. condition 1)
  linear.fit <- lmFit(data, experimental.design)
  contrast <- paste0(group[2],"-",group[1])
  contrast.matrix <- makeContrasts(contrast = contrast, levels = group)
  contrast.linear.fit <- contrasts.fit(linear.fit, contrast.matrix)
  contrast.results <- eBayes(contrast.linear.fit)
  
  # New dataframe with stats
  data_stats <- topTable(contrast.results, number = nrow(data), coef = 1)
  new_name <- paste0(sub("_.*.?", "", i), "_stats")
  assign(new_name, data_stats)
  
  files2 <- append(files2, new_name)
  
  output <- paste(rownames(data_stats[1, ]), "has the following statistics",
                  paste(colnames(data_stats), data_stats[1, ], sep = ": ", collapse = ", "))
  cat(sprintf("The file %s has been processed using the Limma package, comparing condition %s against condition %s, and the statistical information obtained has been saved in the file %s. Here is an example of the information it contains: metabolite %s\n", i, group[2], group[1], new_name, output))
  cat("\n")
}
## The file compA_normalized has been processed using the Limma package, comparing condition cA_S17 against condition cA_S25, and the statistical information obtained has been saved in the file compA_stats. Here is an example of the information it contains: metabolite Glycerol has the following statistics logFC: -15.2611285814967, AveExpr: 19.8590589478011, t: -339.421830033795, P.Value: 2.44525606076706e-12, adj.P.Val: 5.25227343151613e-10, B: 14.2584863306686
## 
## The file compB_normalized has been processed using the Limma package, comparing condition cB_OTA_S17 against condition cB_noOTA_S25, and the statistical information obtained has been saved in the file compB_stats. Here is an example of the information it contains: metabolite Nipam has the following statistics logFC: 14.0070448715851, AveExpr: 18.6192186042277, t: 240.266295011719, P.Value: 5.94443525340435e-12, adj.P.Val: 2.43735565542852e-09, B: 13.9197684208487
## 
## The file compC_normalized has been processed using the Limma package, comparing condition cC_OTA against condition cC_noOTA, and the statistical information obtained has been saved in the file compC_stats. Here is an example of the information it contains: metabolite mz_112.0279_&_RT_17.601 has the following statistics logFC: 16.5613749241561, AveExpr: 17.3794999971448, t: 138.496503987177, P.Value: 2.56289368718984e-20, adj.P.Val: 2.76648967319951e-17, B: 26.8836921918308
## 
## The file compD_normalized has been processed using the Limma package, comparing condition cD_OTA against condition cD_noOTA, and the statistical information obtained has been saved in the file compD_stats. Here is an example of the information it contains: metabolite (R)-Lactaldehyde has the following statistics logFC: 19.5217102509373, AveExpr: 18.3019880186763, t: 262.057173035409, P.Value: 3.36240317517438e-33, adj.P.Val: 5.24534895327203e-30, B: 44.4870235270816
## 
## The file compE_normalized has been processed using the Limma package, comparing condition cE_OTA against condition cE_noOTA, and the statistical information obtained has been saved in the file compE_stats. Here is an example of the information it contains: metabolite mz_104.10667_&_RT_0.917 has the following statistics logFC: -16.2860134121705, AveExpr: 19.7587028745205, t: -475.527014405671, P.Value: 7.27252016105157e-13, adj.P.Val: 3.35233856646671e-10, B: 15.090635266556
cat(sprintf("Vector files2 has the following information %s.\n", paste(files2, collapse = ", ")))
## Vector files2 has the following information compA_stats, compB_stats, compC_stats, compD_stats, compE_stats.

p-value adjustment

Multiple comparisons refer to the situation where several pairwise comparisons are made within a dataset or experiment. When conducting multiple comparisons, the probability of obtaining false positive results (Type I errors) increases. To address this issue, various statistical methods can be used to adjust the significance level (p-value) or control the overall error rate.

Methods to control the Family Wise Error (FWER):

It is a statistical concept that refers to the probability of making at least one Type I error (rejecting a true null hypothesis) among a set of multiple hypothesis tests. The FWER is a measure that quantifies the overall error rate when multiple hypotheses are tested simultaneously.

p-value adjustment by Bonferroni:

It involves dividing the significance level (p-value) by the number of comparisons performed. For example, if there are 10 comparisons being made and a desired overall significance level of 5%, the p-value threshold would be adjusted to 0.005 (0.05 divided by 10). If the p-value obtained in a comparison is lower than this adjusted threshold, it is considered statistically significant. This method adjusts the significance level for each individual hypothesis to maintain the desired FWER.

Methods to control the False Discovery Rate (FDR):

The “BH” and “BY” methods of Benjamini, Hochberg, and Yekutieli control the false discovery rate (FDR), the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others.

p-value adjustment by Benjamini-Hochberg (BH):

The Benjamini-Hochberg method is another technique used to control the type I error in multiple comparisons. Unlike Bonferroni, B-H is less conservative and can be more powerful in detecting statistically significant differences. Instead of adjusting each individual p-value, B-H controls the expected proportion of false discoveries among the rejected hypotheses. This is achieved by ordering the p-values from lowest to highest, calculating a critical value based on the desired FDR rate, and comparing it to each p-value in order. P-values below the critical value are considered statistically significant.

q- value:

Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

P-VALUE ADJUSTMENT

for (i in files2) {
  message(paste("Processing file:", i))
  data <- get(i)
  data$p_value_bonferroni <- p.adjust(data$P.Value, method = "bonferroni", n = length(data$P.Value))
  names(data)[names(data) == "adj.P.Val"] <- "p_value_BH"
  new_name <- paste0("q_value_", sub("_.*.?", "", i))
  tmp <- data$P.Value
  if (i == "compC_stats"){
    message(paste("Error in qvalue calculation for file:", i))
    tmp_q_value <- qvalue(tmp, pi0 = 1)
    message(paste("File", i, "processed successfully with pi0 = 1"))
    } else {
      tmp_q_value <- qvalue(tmp)
      message(paste("File", i, "processed successfully"))
    }
  data$q_value <- tmp_q_value$qvalues
  assign(new_name, tmp_q_value)
  assign(i, data)
}
## Processing file: compA_stats
## File compA_stats processed successfully
## Processing file: compB_stats
## File compB_stats processed successfully
## Processing file: compC_stats
## Error in qvalue calculation for file: compC_stats
## File compC_stats processed successfully with pi0 = 1
## Processing file: compD_stats
## File compD_stats processed successfully
## Processing file: compE_stats
## File compE_stats processed successfully
Comparative A statistics of cA_S17 Vs. cA_S25
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
Glycerol -15.26113 19.85906 -339.4218 2.4e-12 5.3e-10 14.25849 1.4e-09 9.4e-11
mz_104.10667_&_RT_0.917 15.26113 19.85906 339.4218 2.4e-12 5.3e-10 14.25849 1.4e-09 9.4e-11
Choline -14.92010 19.68854 -331.8369 2.7e-12 5.3e-10 14.24841 1.6e-09 9.4e-11
mz_113.0342_&_RT_1.044 12.97611 18.71655 267.5657 8.3e-12 1.2e-09 14.12837 4.7e-09 2.1e-10
2-Hydrazino-4-methoxy-6-(4-morpholinyl)-1,3,5-triazine 11.18270 17.81985 239.0582 1.5e-11 1.5e-09 14.04450 8.4e-09 2.7e-10
(R)-Coclaurine -13.37509 18.91604 -235.1382 1.6e-11 1.5e-09 14.03074 9.2e-09 2.7e-10
Comparative B statistics of cB_OTA_S17 Vs. cB_noOTA_S25
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
Nipam 14.00704 18.61922 240.2663 5.9e-12 2.4e-09 13.91977 4.4e-09 4.8e-10
MFCD00053868 13.77940 18.50540 225.6247 8.3e-12 2.4e-09 13.87836 6.1e-09 4.8e-10
mz_124.03888_&_RT_0.894 13.95809 18.59474 218.3975 9.9e-12 2.4e-09 13.85503 7.3e-09 4.8e-10
mz_112.89548_&_RT_0.793 -11.59652 17.41396 -189.3841 2.1e-11 3.7e-09 13.73566 1.6e-08 7.3e-10
[FAamino(5:0)]2S-amino-pentanoicacid 11.61473 17.42306 170.9660 3.6e-11 3.7e-09 13.63021 2.7e-08 7.3e-10
3-Carboxy-1-methylpyridinium;N-Methylnicotinicacid 9.80792 16.51966 166.3762 4.2e-11 3.7e-09 13.59901 3.1e-08 7.3e-10
Comparative C statistics of cC_OTA Vs. cC_noOTA
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
mz_112.0279_&_RT_17.601 16.56137 17.37950 138.4965 2.6e-20 2.8e-17 26.88369 3.2e-17 2.8e-17
mz_113.03416_&_RT_1.008 -16.68298 17.44030 -127.6254 6.7e-20 2.8e-17 26.73935 8.3e-17 2.8e-17
MFCD00053868 16.07279 17.13521 127.5828 6.7e-20 2.8e-17 26.73871 8.3e-17 2.8e-17
1_8-Diazacyclotetradecane-2_9-dione 12.96894 15.58329 123.3112 1.0e-19 3.1e-17 26.67233 1.2e-16 3.1e-17
Nicotinicacid -15.87450 17.03606 -120.9224 1.3e-19 3.1e-17 26.63244 1.6e-16 3.1e-17
mz_185.10164_&_RT_0.896 -13.72447 15.96105 -113.1935 2.8e-19 5.1e-17 26.48774 3.4e-16 5.1e-17
Comparative D statistics of cD_OTA Vs. cD_noOTA
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
(R)-Lactaldehyde 19.52171 18.30199 262.0572 3.4e-33 5.2e-30 44.48702 5.2e-30 9.9e-31
mz_57.03374_&_RT_0.876 19.02535 18.05381 239.3348 1.7e-32 8.7e-30 44.36480 2.6e-29 1.6e-30
mz_57.03375_&_RT_0.96 -19.02535 18.05381 -239.3348 1.7e-32 8.7e-30 44.36480 2.6e-29 1.6e-30
mz_112.02786_&_RT_17.614 -17.87461 17.47844 -210.1631 1.7e-31 6.5e-29 44.15001 2.6e-28 1.2e-29
C2H3N4P -15.80142 16.44184 -180.2693 2.5e-30 7.8e-28 43.82275 3.9e-27 1.5e-28
mz_112.02788_&_RT_17.698 17.50094 17.29160 164.7777 1.2e-29 3.2e-27 43.58623 1.9e-26 6.0e-28
Comparative E statistics of cE_OTA Vs. cE_noOTA
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
mz_104.10667_&_RT_0.917 -16.28601 19.75870 -475.5270 7.3e-13 3.4e-10 15.09064 5.5e-10 1.2e-11
Xanthurenic acid -15.67474 19.45307 -457.6787 8.8e-13 3.4e-10 15.07688 6.7e-10 1.2e-11
Cholinesulfate 13.76456 20.40815 401.9045 1.7e-12 4.3e-10 15.02208 1.3e-09 1.5e-11
mz_113.0342_&_RT_1.044 -13.63820 18.43480 -338.8689 4.0e-12 7.6e-10 14.92713 3.0e-09 2.7e-11
MFCD00053868 11.68451 19.36812 302.9741 7.0e-12 1.0e-09 14.84705 5.3e-09 3.6e-11
2-Hydrazino-4-methoxy-6-(4-morpholinyl)-1,3,5-triazine -11.83822 17.53481 -293.2725 8.2e-12 1.0e-09 14.82068 6.2e-09 3.6e-11

P-VALUE REPRESENTATION

for (i in files2){
  # Q-value plots:
  q_value_plot <- paste0("q_value_", sub("_.*.?", "", i))
  q_value_plot <- get(q_value_plot)
  plot(q_value_plot, main = paste("Q-value plots of file", i)) 
  
  # P-values histograms:
  data <- get(i)
  hist(data$P.Value, main = paste("Original p-value of file", i), 
       xlab = "p-value", col = "gold")
}

Clear environment

——————————————————————————–


Graphs and results of significant metabolites

Significant metabolites

## SIGNIFICANT METABOLITES in file compA_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 574. 
## 
## The significant metabolites in file compA_stats by original p-values obteined by limma are 351. 
##   - 162 are increased. 
##   - 189 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is mz_104.10667_&_RT_0.917. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Glycerol.
## 
## The significant metabolites in file compA_stats by Bonferroni p-values are 303. 
##   - 149 are increased. 
##   - 154 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is mz_104.10667_&_RT_0.917. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Glycerol.
## 
## The significant metabolites in file compA_stats by Benjamini-Hochberg p-values are 350. 
##   - 162 are increased. 
##   - 188 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is mz_104.10667_&_RT_0.917. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Glycerol.
## 
## The significant metabolites in file compA_stats by q-value testare 355. 
##   - 163 are increased. 
##   - 192 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is mz_104.10667_&_RT_0.917. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Glycerol.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compB_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 740. 
## 
## The significant metabolites in file compB_stats by original p-values obteined by limma are 432. 
##   - 243 are increased. 
##   - 189 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is Nipam. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is mz_112.89548_&_RT_0.793.
## 
## The significant metabolites in file compB_stats by Bonferroni p-values are 384. 
##   - 211 are increased. 
##   - 173 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is Nipam. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is mz_112.89548_&_RT_0.793.
## 
## The significant metabolites in file compB_stats by Benjamini-Hochberg p-values are 431. 
##   - 243 are increased. 
##   - 188 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is Nipam. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is mz_112.89548_&_RT_0.793.
## 
## The significant metabolites in file compB_stats by q-value testare 439. 
##   - 245 are increased. 
##   - 194 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is Nipam. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is mz_112.89548_&_RT_0.793.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compC_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 1230. 
## 
## The significant metabolites in file compC_stats by original p-values obteined by limma are 852. 
##   - 456 are increased. 
##   - 396 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is mz_112.0279_&_RT_17.601. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is mz_113.03416_&_RT_1.008.
## 
## The significant metabolites in file compC_stats by Bonferroni p-values are 729. 
##   - 382 are increased. 
##   - 347 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is mz_112.0279_&_RT_17.601. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is mz_113.03416_&_RT_1.008.
## 
## The significant metabolites in file compC_stats by Benjamini-Hochberg p-values are 851. 
##   - 455 are increased. 
##   - 396 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is mz_112.0279_&_RT_17.601. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is mz_113.03416_&_RT_1.008.
## 
## The significant metabolites in file compC_stats by q-value testare 851. 
##   - 455 are increased. 
##   - 396 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is mz_112.0279_&_RT_17.601. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is mz_113.03416_&_RT_1.008.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compD_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 1560. 
## 
## The significant metabolites in file compD_stats by original p-values obteined by limma are 920. 
##   - 439 are increased. 
##   - 481 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is (R)-Lactaldehyde. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is mz_57.03375_&_RT_0.96.
## 
## The significant metabolites in file compD_stats by Bonferroni p-values are 829. 
##   - 411 are increased. 
##   - 418 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is (R)-Lactaldehyde. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is mz_57.03375_&_RT_0.96.
## 
## The significant metabolites in file compD_stats by Benjamini-Hochberg p-values are 917. 
##   - 436 are increased. 
##   - 481 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is (R)-Lactaldehyde. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is mz_57.03375_&_RT_0.96.
## 
## The significant metabolites in file compD_stats by q-value testare 934. 
##   - 452 are increased. 
##   - 482 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is (R)-Lactaldehyde. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is mz_57.03375_&_RT_0.96.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compE_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 761. 
## 
## The significant metabolites in file compE_stats by original p-values obteined by limma are 439. 
##   - 210 are increased. 
##   - 229 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is Cholinesulfate. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is mz_104.10667_&_RT_0.917.
## 
## The significant metabolites in file compE_stats by Bonferroni p-values are 386. 
##   - 178 are increased. 
##   - 208 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is Cholinesulfate. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is mz_104.10667_&_RT_0.917.
## 
## The significant metabolites in file compE_stats by Benjamini-Hochberg p-values are 433. 
##   - 204 are increased. 
##   - 229 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is Cholinesulfate. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is mz_104.10667_&_RT_0.917.
## 
## The significant metabolites in file compE_stats by q-value testare 480. 
##   - 241 are increased. 
##   - 239 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is Cholinesulfate. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is mz_104.10667_&_RT_0.917.
## 
## |------------------------|

Representation of significant metabolites according to the p-value used

for (i in files2){
  data <- get(i)
  # Obtain the significant metabolites by original p-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & P.Value < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & P.Value < stats_values$alpha_value))
  Original_mets <- row.names(significant)
  # Obtain the significant metabolites by BH p-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_BH < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value))
  BH_mets <- row.names(significant)
  # Obtain the significant metabolites by original p-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value))
  Bonferroni_mets <- row.names(significant)
  # Obtain the significant metabolites by q-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & q_value < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & q_value < stats_values$alpha_value))
  Qvalue_mets <- row.names(significant)
  
  significant <- list(Original_mets = Original_mets, BH_mets = BH_mets, Bonferroni_mets = Bonferroni_mets,
                      Qvalue_mets = Qvalue_mets)

  # Performance Venn diagram
  random_color <- sample(colors(), 1)
  plot <- ggVennDiagram(significant, label_alpha = 0.4) +
    scale_fill_gradient(low="white", high = random_color) + labs(fill = "Significant \nmetabolites") +
    ggtitle(paste("Venn Diagram of file", i,"with the significant \nmetabolites according to the p-value used")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
  print(plot)
  
  # Upset diagram
  plot <- upset(fromList(significant),
      number.angles = 0, point.size = 3, line.size = 1,
      sets.x.label = "Number of significant \n metabolites",
      mainbar.y.label = paste("Upset Diagram of file", i,
                              "\nwith the significant metabolites \naccording to the p-value used
                              \n\nIntersection Size"), 
      set_size.show = T, matrix.color = random_color,
      set_size.scale_max = max(sapply(significant, length))+50,
      text.scale = c(1.5, 1.5, 1, 1, 1.4, 2), order.by = "freq") 
  print(plot)
}

Mean-Average (MA) plot

# Set the p_value adjustment method selected to use in the following steps. You can choose between: p_value_BH, p_value_bonferroni, q_value and P.Value
selected_p_adjustment <- "p_value_BH"

## Create a function that bind columns and complete the missing rows with NA values
cbind.fill <- function(...){
  nm <- list(...) 
  nm<-lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) {
    rbind(x, matrix(NA, n-nrow(x), ncol(x)))
  }))
}

for (i in files2) {
  # Prepare datasets
  data <- get(i)
  names(data) <- gsub("logFC", "log2FoldChange", gsub("AveExpr", "baseMean", gsub(selected_p_adjustment, "padj", names(data))))
  data <- data[c("log2FoldChange", "baseMean", "padj")]
  
  # Save significant metabolites as .csv file, filling with NA values till the end of the dataframe.
  ## Select the desired metabolites to save
  increased <- subset(data, log2FoldChange > stats_values$LFC_limit  & padj < stats_values$alpha_value) %>%
    rownames_to_column() %>% .[, 1] %>% as.data.frame()
  decreased <- subset(data, log2FoldChange < - stats_values$LFC_limit  & padj < stats_values$alpha_value) %>%
    rownames_to_column() %>% .[, 1] %>% as.data.frame()
  ## Save data
  significant_data <- cbind.fill(increased, decreased) %>% `colnames<-`(lm_contrast[sub("_.*.?", "", i)][[1]])
  write.csv(significant_data, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i),
                                                             "_significant_mets.csv")), row.names = FALSE)
  
  # Performance MA plot
  plot <- ggmaplot(
    data, fdr = stats_values$alpha_value, fc = stats_values$LFC_limit,
    genenames = row.names(data), detection_call = NULL, size = 1.5,
    alpha = 1, seed = 42, font.label = c(10, "italic", "black"),
    label.rectangle = F, palette = c("#B31B21", "#1465AC", "darkgray"), 
    top = 4, select.top.method = c("padj", "fc"), label.select = NULL,
    main = NULL, xlab = "Log2 mean expression", ylab = "Log2 fold change",
    ggtheme = theme_pubr()) + 
    ggtitle(paste0("MA plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
                  " Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
  
  ggsave(filename = paste0(image_folder, "/MA_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
  
  print(plot)
}

Volcano plot of significant metabolites

j = 0
plots <- list()
tables1 <- list()
RDS_files <- list()

for (i in files2){
  data <- get(i) %>% rownames_to_column("Metabolites") %>% .[c("Metabolites", "logFC", selected_p_adjustment)]
  data <- data %>% 
    mutate(Significant = ifelse(abs(logFC) < stats_values$LFC_limit & p_value_BH > stats_values$alpha_value, "ns",
                         ifelse(logFC >= stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "up",
                                ifelse(logFC <= -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "down",
                                       "ns"))))
  
  # Make a table with the 10 most significant metabolites to each condition
  mets <- bind_cols(data %>% arrange(desc(Significant)) %>% head(10) %>% .$Metabolites, 
                    data %>% arrange(Significant) %>% head(10) %>% .$Metabolites) %>% 
    `colnames<-`(c(lm_contrast[sub("_.*.?", "", i)][[1]][1], lm_contrast[sub("_.*.?", "", i)][[1]][2]))
  ## Save data set in a list with the correspondence name
  RDS_files[[sub("_.*.?", "", i)]] <- mets
  
  j <- j + 1
  
  table <- mets %>% 
    kbl(caption = paste("Table", j,": The 10 most significant metabolites to each condition of",
                        sub("_.*.?", "", i), "are: ")) %>% 
    kable_classic(full_width = F, html_font = "Cambria")
  
  index <- length(tables1) + 1
  tables1[[index]] <- table
  
  # Make a dataframe with the most significant metabolites, in order to fix them in the volcano plot
  mets <- list()
  mets <- bind_rows(data %>% arrange(desc(Significant)) %>% head(5), 
                    data %>% arrange(Significant) %>% head(5))
  
  plot <- data %>% 
    ggplot(aes(x=logFC, y=-log10(p_value_BH), color=Significant,
              text=paste0("</br>Metabolite: ", Metabolites,
                          "</br>Adj p-value: ", p_value_BH,
                         "</br>LFC: ", logFC))) +
    geom_point(alpha=0.75, shape=16) + xlim(-15,15) +
    theme_gray() + theme(legend.position = "bottom") + 
    ggtitle(paste0("Volcano plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
                  " Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold")) +
    labs(color = "Significant \nMetabolite") + 
    scale_color_manual(values = c("up" = "#faaa11", "down" = "#26b3ff", "ns" = "grey")) +
    geom_point(data = mets, aes(color = Significant), size = 4) +
    geom_text_repel(data = mets, aes(label = Metabolites), size = 3, color = "black")

  print(plot)
  ggsave(filename = paste0(image_folder, "/Volcano_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
  
  index <- length(plots) + 1
  plots[[index]] <- plot
}
## New names:
## New names:
## • `` -> `...1`
## • `` -> `...2`

## New names:
## • `` -> `...1`
## • `` -> `...2`

## New names:
## • `` -> `...1`
## • `` -> `...2`

## New names:
## • `` -> `...1`
## • `` -> `...2`

#if (imputed == "Yes") {
#  saveRDS(RDS_files, file = "../../Tables_TFM/significant_mets.rds")
#} else {
#  saveRDS(RDS_files, file = "../../Tables_TFM/significant_mets_common.rds")
#}

Dynamic Volcano plot of significant metabolites

Tables with the most significant metabolites for each comparative

tables1[[1]]
Table 1 : The 10 most significant metabolites to each condition of compA are:
cA_S17 cA_S25
mz_104.10667_&_RT_0.917 Glycerol
mz_113.0342_&_RT_1.044 Choline
2-Hydrazino-4-methoxy-6-(4-morpholinyl)-1,3,5-triazine (R)-Coclaurine
Acetyl-L-carnitine N-({2-[(Dimethylcarbamoyl)amino]ethyl}sulfamoyl)-N-methyl-beta-alanine
Cyclopentenone TO0110000
C10H21N2O6P mz_185.10163_&_RT_0.815
Catechol;1_2-Dihydroxybenzene (R)-Norcoclaurine
C11H17O2PS 8-(4-Ethyl-1-piperazinyl)-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione
mz_112.89545_&_RT_0.89 mz_112.8955_&_RT_0.753
Boc-NH-PEG4-CH2CH2NH2 9-epi-sacrolide A
tables1[[2]]
Table 2 : The 10 most significant metabolites to each condition of compB are:
cB_OTA_S17 cB_noOTA_S25
Nipam mz_112.89548_&_RT_0.793
MFCD00053868 Nicotinicacid
mz_124.03888_&_RT_0.894 (1S,2S,3R)-4-hydroxymethylcyclopent-4-ene-1,2,3-triol
[FAamino(5:0)]2S-amino-pentanoicacid 1_8-Diazacyclotetradecane-2_9-dione
3-Carboxy-1-methylpyridinium;N-Methylnicotinicacid Metanephrine
Styrene (12Z)-9,10,11-trihydroxyoctadec-12-enoic acid
Purine chavicol
mz_185.10162_&_RT_0.791 Pyrindan
C17H39P3 (1S,2R,8aS)-1,2-Dihydroxyindolizidine
2,6-di-tert-butyl-4-ethylphenol S-Isopropyl thiosenecioate
tables1[[3]]
Table 3 : The 10 most significant metabolites to each condition of compC are:
cC_OTA cC_noOTA
mz_112.0279_&_RT_17.601 mz_113.03416_&_RT_1.008
MFCD00053868 Nicotinicacid
1_8-Diazacyclotetradecane-2_9-dione mz_185.10164_&_RT_0.896
3,5-BIS(TRIFLUOROMETHYL)-1-PHENYLPYRAZOLE mz_112.89547_&_RT_0.836
mz_113.03411_&_RT_0.916 1-Chloro-6-fluorohexane
mz_185.10162_&_RT_0.799 Styrene
mz_112.89545_&_RT_0.738 D-Proline
mz_124.03892_&_RT_0.916 Naphthalene
mz_236.22152_&_RT_7.605 S-Isopropyl thiosenecioate
mz_190.00047_&_RT_17.397 C5H13N5OS2
tables1[[4]]
Table 4 : The 10 most significant metabolites to each condition of compD are:
cD_OTA cD_noOTA
(R)-Lactaldehyde mz_57.03375_&_RT_0.96
mz_57.03374_&_RT_0.876 mz_112.02786_&_RT_17.614
mz_112.02788_&_RT_17.698 C2H3N4P
C7H8N7O4P mz_169.98957_&_RT_17.676
mz_185.10164_&_RT_0.875 mz_113.03417_&_RT_0.923
1_8-Diazacyclotetradecane-2_9-dione C8H14N10S2
mz_104.50919_&_RT_17.472 mz_236.22148_&_RT_7.628
mz_190.0005_&_RT_17.471 C3H11N8PS
3,5-BIS(TRIFLUOROMETHYL)-1-PHENYLPYRAZOLE C10H4N8O2
4-Chloro-N-cyclopropyl-N-(4-piperidinyl)benzenesulfonamide C7H22N9O6P
tables1[[5]]
Table 5 : The 10 most significant metabolites to each condition of compE are:
cE_OTA cE_noOTA
Cholinesulfate mz_104.10667_&_RT_0.917
MFCD00053868 Xanthurenic acid
N-Methylpyrrolidone mz_113.0342_&_RT_1.044
Nipam 2-Hydrazino-4-methoxy-6-(4-morpholinyl)-1,3,5-triazine
3-Carboxy-1-methylpyridinium;N-Methylnicotinicacid Cyclopentenone
3-[(5-Amino-6-chloro-4-pyrimidinyl)amino]-2-nonanol α-Eleostearic acid
Stachydrine Hypoxanthine
[FAamino(5:0)]2S-amino-pentanoicacid Boc-NH-PEG4-CH2CH2NH2
Purine S-Isopropyl thiosenecioate
5-hydroxynorvaline C11H17O2PS

Clear environment

——————————————————————————–


Multivariant analysis

sPLS-DA

PLS-DA is a statistical method used for classification. It helps to predict or classify samples into different groups based on a set of variables. It is particularly useful for high-dimensional data like metabolomics. By applying PLS-DA, you can identify the important variables that distinguish between different groups, providing valuable insights for further analysis.

In PLS-DA, the method combines the principles of Partial Least Squares (PLS) regression and Discriminant Analysis. It aims to find a linear combination of the predictor variables that maximally explains the variation in the response variable while also maximizing the separation between different groups or categories.

In this case, given the large number of variables available, we decided to perform a sparse PLS-DA analysis. We limited the analysis to 2 components based on the good separation achieved in the PCA assay. Additionally, we reduced the number of variables to 25.

The prediction capacity of the “pls” model will be analysed by using the “predict” function from mixOmics package. The algorithim used to identify the similarities between the model and the new data is the Mahalanobis distance.

tables1 <- list()
tables2 <- list()
RDS_files <- list()
RDS_files1 <- list()

for (i in files) {
  j <- j + 1 
  data <- get(i)
  Y <- as.character(data[1 ,]) %>% .[-1] %>% as.factor(.)
  mets <- (data[, 1]) %>% .[-1] %>% as.data.frame() %>% `colnames<-`(., "Mets_names") %>% 
    mutate(ID = paste0("Metabolite_", 1:(nrow(data)-1)))
  write.csv(mets, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i), "_mets_names.csv")), row.names = FALSE)
  data <- data[-1, -1] %>% mutate_all(as.numeric) %>% t() %>% `colnames<-`(., mets$ID)
  
  
  # Prepare data for the sPLS-DA 
  ## Remove variables with zero standard deviation
  sd_values <- apply(data, 2, sd)
  filtered_data <- data[, sd_values > 0] %>% as.data.frame(); X <- filtered_data
  
  # Perform sPLS-DA
  tmp_pls <- mixOmics::splsda(X, Y, ncomp = 2, 
                                 keepX = 25, 
                                 scale = FALSE)

  
  # Prediction capacity
  ## Obtain the mean sd value calculated per sample in the dataset
  mean_sd <- mean(apply(X, 1, sd))
  ## Randomly select 5 observations from the X matrix
  set.seed(123)
  s <- sample(1:nrow(X), 5)
  Xnew = X[s, , drop = FALSE]
  ## Modify original samples
  Xnew = t(apply(Xnew, 1, function(x){x + rnorm(ncol(X), 0, mean_sd/10)}))
  ## Perform de prediction
  myprediction = predict(tmp_pls, Xnew, dist = "mahalanobis.dist")
  ## Save data set in a list with the correspondence name
  RDS_files[[sub("_.*.?", "", i)]] <- myprediction$class
  
  table <- myprediction$class %>% 
    kbl(caption = paste("Table", j,": The predicted classes for the randomly and edited samples of",
                        sub("_.*.?", "", i), "are: ")) %>% 
    kable_classic(full_width = F, html_font = "Cambria")
  
  index <- length(tables1) + 1
  tables1[[index]] <- table
  
  
  # The 10 most important predictor variables selection to each group
  ## Those with positive weight in component 1 -> group_1
  group_1 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(desc(comp1)) %>% 
    head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
  ## Those with negative weight in component 1 -> group_2
  group_2 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(comp1) %>% 
    head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
  
  ## If the sign of the weight for group 1 in the first component is negative, it means that the metabolites associated with group 1 by the sPLS-DA technique will have a negative weight. On the other hand, if the sign of the weight for group 1 is positive, it indicates that the metabolites associated with group 1 by the sPLS-DA technique will have a positive weight.
  if (sign(tmp_pls$loadings$Y[1,1]) < 0){
    group <- tmp_pls$loadings$Y %>% row.names()
  } else {
    group <- tmp_pls$loadings$Y %>% row.names() %>% rev()
  }
  VIP_metabolites <- cbind(group_2, group_1) %>% `colnames<-`(., group)
  
  table <- VIP_metabolites %>% 
    kbl(caption = paste("Table", j+5,": The 10 most important predictor variables of",
                        sub("_.*.?", "", i), "are: ")) %>% 
    kable_classic(full_width = F, html_font = "Cambria")

  index <- length(tables2) + 1
  tables2[[index]] <- table
  
  
  # Save PLS variables loadings in RDS files 
  ## Change ID by their metabolite name
  pls_loadings <- tmp_pls$loadings$X %>% .[, 1] %>% as.data.frame() 
  new_row_names <- rownames(pls_loadings) %>% {mets$Mets_names[match(., mets$ID)]}
  rownames(pls_loadings) <- new_row_names 
  ## Split the data frame in a positive and negative one. Without 0 weight
  positive_weight <- subset(pls_loadings, . > 0) %>% arrange(.) %>% `colnames<-`(., group[2])
  negative_weight <- subset(pls_loadings, . < 0) %>% arrange(.) %>% `colnames<-`(., group[1])
  ## Save both data set in a list with the correspondence name
  list_1 <- list()
  list_1[["positive_weight"]] <- positive_weight; list_1[["negative_weight"]] <- negative_weight
  RDS_files1[[sub("_.*.?", "", i)]] <- list_1
  
  
  # GRAPHS
  ## Loadings
  plotLoadings(tmp_pls, comp = 1, method = 'mean', contrib = 'max',
               title = paste("Loadings of the 25 principal \nmetabolites of", sub("_.*.?", "", i)))

  ## Variables (Metabolites)
  plotVar(tmp_pls, comp = c(1,2), 
        var.names = F, cex = 3, cutoff = 0.8,
        title = paste("Correlation plot of the 25 \nprincipal metabolites of", sub("_.*.?", "", i)))

  ## HeatMap
  #Execute directly in console
  X11()
  cim(tmp_pls, comp = 1,
      xlab = "Metabolites", ylab = "Samples", margins = c(7, 15),
      title = paste("Heatmap of", sub("_.*.?", "", i), 
                    "\nnormalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"), lwid = c(0.1, 0.9)) 
  dev.off()

  ## Individual samples per component
  my_colors <- c("steelblue3", "green3")
  plotIndiv(tmp_pls, comp = 1:2,
                group = Y, col = my_colors, style = "ggplot2", 
                title = paste("PC1 & PC2", i), 
                legend = TRUE, legend.position = "right", 
                legend.title = "Sampling condition", ellipse = TRUE, 
                ellipse.level = 0.95, centroid = FALSE)
}

#if (imputed == "Yes") {
#  saveRDS(RDS_files, file = "../../Supplementary_material/Supplementary_tables/predictions_results.rds")
#  saveRDS(RDS_files1, file = "../../Tables_TFM/loadings_pls_comparatives.rds")
#} else {
#  saveRDS(RDS_files, file = "../../Supplementary_material/Supplementary_tables/predictions_results_common.rds")
#  saveRDS(RDS_files1, file = "../../Tables_TFM/loadings_pls_comparatives_common.rds")
#}

Tables

Tables of predictions results:

Table 6 : The predicted classes for the randomly and edited samples of compA are:
comp1 comp2
area_2_3_15_S25_noOTA cA_S25 cA_S25
area_2_3_15_S17_noOTA cA_S17 cA_S17
area_2_2_15_S25_noOTA cA_S25 cA_S25
area_2_1_15_S17_noOTA cA_S17 cA_S17
area_2_2_15_S17_noOTA cA_S17 cA_S17
Table 7 : The predicted classes for the randomly and edited samples of compB are:
comp1 comp2
area_8_3_15_S25_noOTA cB_noOTA_S25 cB_noOTA_S25
area_8_3_15_S17_OTA cB_OTA_S17 cB_OTA_S17
area_8_2_15_S25_noOTA cB_noOTA_S25 cB_noOTA_S25
area_8_1_15_S17_OTA cB_OTA_S17 cB_OTA_S17
area_8_2_15_S17_OTA cB_OTA_S17 cB_OTA_S17
Table 8 : The predicted classes for the randomly and edited samples of compC are:
comp1 comp2
area_2_3_15_J17_noOTA cC_noOTA cC_noOTA
area_10_3_15_J25_OTA cC_OTA cC_OTA
area_10_1_15_J25_OTA cC_OTA cC_OTA
area_2_2_15_J17_noOTA cC_noOTA cC_noOTA
area_2_3_15_J25_noOTA cC_noOTA cC_noOTA
Table 9 : The predicted classes for the randomly and edited samples of compD are:
comp1 comp2
area_11_3_15_J25_OTA cD_OTA cD_OTA
area_11_2_15_J25_OTA cD_OTA cD_OTA
area_11_3_856_S25_noOTA cD_noOTA cD_noOTA
area_11_1_856_J25_OTA cD_OTA cD_OTA
area_11_2_856_S25_noOTA cD_noOTA cD_noOTA
Table 10 : The predicted classes for the randomly and edited samples of compE are:
comp1 comp2
area_2_3_15_S17_noOTA cE_noOTA cE_noOTA
area_8_3_15_S17_OTA cE_OTA cE_OTA
area_2_2_15_S17_noOTA cE_noOTA cE_noOTA
area_8_1_15_S17_OTA cE_OTA cE_OTA
area_8_2_15_S17_OTA cE_OTA cE_OTA

Tables of most important predictor variables:

Table 11 : The 10 most important predictor variables of compA are:
cA_S17 cA_S25
mz_104.10667_&_RT_0.917 Glycerol
mz_113.0342_&_RT_1.044 Choline
Acetyl-L-carnitine (R)-Coclaurine
2-Hydrazino-4-methoxy-6-(4-morpholinyl)-1,3,5-triazine NP-020156
C10H21N2O6P TO0110000
mz_185.10165_&_RT_0.95 mz_112.8955_&_RT_0.753
mz_112.89545_&_RT_0.89 Styrene
Neuraminicacid mz_185.10163_&_RT_0.815
mz_84.95949_&_RT_0.86 Adenosine
Deca-5,7,9-triynol N-({2-[(Dimethylcarbamoyl)amino]ethyl}sulfamoyl)-N-methyl-beta-alanine
Table 12 : The 10 most important predictor variables of compB are:
cB_noOTA_S25 cB_OTA_S17
Bis(4-ethylbenzylidene)sorbitol 1,5-Pentanedithiol
S-Isopropyl thiosenecioate Styrene
Nicotinicacid Nipam
D-Proline mz_124.03888_&_RT_0.894
mz_112.89548_&_RT_0.793 MFCD00053868
1-phenyl-1-butyne-3-ene mz_113.03412_&_RT_0.893
mz_185.10167_&_RT_0.852 melamine
(1S,2S,3R)-4-hydroxymethylcyclopent-4-ene-1,2,3-triol (R)-Coclaurine
N-(2-Amino-2-oxoethyl)-N’-cyclododecylsuccinamide mz_112.89546_&_RT_0.73
1_8-Diazacyclotetradecane-2_9-dione [FAamino(5:0)]2S-amino-pentanoicacid
Table 13 : The 10 most important predictor variables of compC are:
cC_OTA cC_noOTA
mz_112.0279_&_RT_17.601 D-Proline
MFCD00053868 mz_113.03416_&_RT_1.008
mz_113.03411_&_RT_0.916 S-Isopropyl thiosenecioate
1,5-Pentanedithiol Nicotinicacid
mz_124.03892_&_RT_0.916 Styrene
3,5,6,7-Tetrahydrotetrazolo[1,5-b][1,2,4]triazine 5’-deoxyguanosine
mz_112.89545_&_RT_0.738 mz_112.89547_&_RT_0.836
mz_128.07011_&_RT_0.913 Naphthalene
1,3-BIS(METHYLTHIO)-2-PROPANOL Imidazoleaceticacid
C12H23N4O2P sn-glycero-3-Phosphocholine
Table 14 : The 10 most important predictor variables of compD are:
cD_OTA cD_noOTA
(R)-Lactaldehyde mz_57.03375_&_RT_0.96
mz_57.03374_&_RT_0.876 mz_112.02786_&_RT_17.614
mz_112.02788_&_RT_17.698 mz_204.12232_&_RT_0.914
1,5-Pentanedithiol 3-[(5-Amino-6-chloro-4-pyrimidinyl)amino]-2-nonanol
mz_113.03416_&_RT_0.989 mz_113.03417_&_RT_0.923
mz_128.07015_&_RT_0.987 C2H3N4P
mz_112.89547_&_RT_0.814 (R)-Coclaurine
Naphthalene mz_112.89545_&_RT_0.739
mz_185.10164_&_RT_0.875 mz_185.1016_&_RT_0.801
C12H23N4O2P mz_84.95948_&_RT_0.707
Table 15 : The 10 most important predictor variables of compE are:
cE_noOTA cE_OTA
mz_104.10667_&_RT_0.917 Choline
Xanthurenic acid Cholinesulfate
D-Proline (S,S)-Anacine
S-Isopropyl thiosenecioate Styrene
Hypoxanthine 1,5-Pentanedithiol
mz_113.0342_&_RT_1.044 Nipam
(S)-Carnitine mz_124.03888_&_RT_0.894
Nicotinicacid MFCD00053868
C12H23N4O2P Deoxyvalidamine
2-Hydrazino-4-methoxy-6-(4-morpholinyl)-1,3,5-triazine mz_113.03412_&_RT_0.893

HeatMap

HeatMaps from the pls data. To see more information run the code in the console and consulte the graph.

HeatMap comparative A
HeatMap comparative A
HeatMap comparative B
HeatMap comparative B
HeatMap comparative C
HeatMap comparative C
HeatMap comparative D
HeatMap comparative D
HeatMap comparative E
HeatMap comparative E

HeatMap of PLS-DA (all metabolites)

HeatMap comparative A of all metabolites
HeatMap comparative A of all metabolites
HeatMap comparative B of all metabolites
HeatMap comparative B of all metabolites
HeatMap comparative C of all metabolites
HeatMap comparative C of all metabolites

HeatMap comparative D of all metabolites HeatMap comparative E of all metabolites

## ########  ##   ##  ######     ######  ##    ##  ######
##    ##     ##   ##  ##         ##      ###   ##  ##   ##
##    ##     ##   ##  ##         ##      ####  ##  ##    ##
##    ##     #######  ######     ######  ## ## ##  ##     ##
##    ##     ##   ##  ##         ##      ##  ####  ##    ##
##    ##     ##   ##  ##         ##      ##   ###  ##   ##
##    ##     ##   ##  ######     ######  ##    ##  ######
## Script completed successfully!